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Sentiment analysis is the process of computationally recognizing and 
classifying the attitudes conveyed in each text towards a particular topic and 
product. which is either positive or negative. Sentiment analysis is one of the 
interesting applications of natural language processing and which is used to 
analyze the social media. Text in social media is casual and it can be written 


either in code-switch or monolingual text. Several researchers have 
implemented sentiment analysis on monolingual text, though sentiments can 
be expressed in code-switch text. Sentiment analysis can be applied through 
deep learning, machine learning, or a Lexicon-based approach. Machine 
learning and deep learning methods are time-consuming, computationally 
expensive, and need training data for analysis. Lexicon-based method does 
not require training data and requires less time to find the sentiments in 
comparison with machine learning and deep learning. In this paper, we 
propose the Lexicon-based approach (NBLex) to analyze the sentiments 
expressed in Kannada-English code-switch text. This is the first effort that 
targets to perform sentiment analysis in Kannada-English code-switch text 
using the Lexicon-based approach. The proposed approach performed with 
better Accuracy of 83.2% and 83% of F1-score. 
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1. INTRODUCTION 

Social media websites such as Twitter, YouTube and Facebook. are open for online users to convey 
their sentiments about movies, products and services. online every day. Sentiment analysis is the process of 
detecting sentiments such as positive and negative from text content. Plutchik [1], proposed “A 
Psychoevolutionary Theory of Emotions” such as two sentiment classes i.e., Positive and Negative with eight 
elementary emotions i.e., Anger, Anticipation, Disgust, Fear, Joy, Sadness, Surprise, and Trust. 

In India, most internet users are multilingual or bilingual. According to the Times of India article 
(Nov 7", 2018), 52% of Indians are bilingual and 18% are multilingual. This multilingualism allows the users 
to use different vernaculars in social media communication. The ability to exchange language is termed code- 
switching or code-mixing [2]. Code-mixing is the common phenomena on social media [3]-[5]. 

Monolingual and code-switch text can be used in social media communication. To demonstrate, Ex1 
stands for monolingual text, where both scripting and source vernaculars are the same (English), and Ex2 
stands for code-switch text, where scripting and source vernaculars are different (scripting language is 
English and the source language is in Kannada). Sentiments can be conveyed through code-switch text or 
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monolingual text. In Exl, the user conveys Positive sentiment and in Ex2, the user conveys Negative 
sentiment. Internet users are comfortable to communicate on social media with code-switch text since they 
have an exemption from grammatical and linguistic rules. 

Ex1: “Abdul kalam is the always best & brilliant President of India” 

Ex2: “Avrgi enu arta haguthe, bidi guru” 

Translation: “What can they understand, leave it, Sir” 

Code-switch text is out-of-vocabulary (OOV) and it is very common in social media text [6]. Some 
researchers are implementing sentiment analysis in the code-switch text by using machine translation 
(translating into the English language). This translation process requires greater effort to improve 
performance. As evident, machine learning and deep learning approaches are incapable of handling OOV 
words and these are computationally expensive and need more training data. Due to these limitations, the 
Lexicon-based approach is used for sentiment analysis instead of machine translation, machine learning, and 
deep learning techniques. 

The Lexicon-based method uses a list of pre-labeled words which are categorized into Positive and 
Negative sentiments [7]. Lexicon-based methods are of two types, namely corpus-based and dictionary- 
based. In the corpus-based approach, Lexicons are constructed based upon the statistics (co-occurrence) of a 
word or semantic of a word. In the dictionary-based method, first limited words are collected (seed words) 
with sentiment labels. The next step is to use the Bootstrap technique to collect all synonyms of seed words 
from dictionary and add these synonyms to the seed list. Some of the well-known Lexicons are Affective 
Norms for English Words (ANEW) [8], WordNet-Affect Lexicon [9], and NRC Word-Emotion Association 
Lexicon [10]. 

The dictionary-based method is not a desired method for code-switch text since it has a restricted 
vocabulary and social media text is OOV. Hence, the corpus-based method can handle these drawbacks since 
there is no boundary for vocabulary. The study proposes a corpus-based Lexicon method to analyze 
sentiment in Kannada-English code-switch text. 

Palanisamy et al. [11] build two types of Lexicons for sentiment analysis (Category specific 
Lexicon and common Lexicon) by using serendio-taxonomy. It has a collection of positive, negative, stop 
words, phrases, and negation. Adding word sense disambiguation improves the performance of the proposed 
approach. Guan et al. [12] proposed two steps for generating a Lexicon in the Chinese language. In the first 
step, they trained one-word vectors from distributed information, and in the second step, used renovated word 
vectors with a similarity-based approach. Desai et al. [13] proposed a technique by adding shallow parsing 
with a sentiment Lexicon. The advantage of this technique is, that it is very accurate and spontaneously 
creates a structure to exclude the cost of physically tagging data. However, it does not permit the inside 
arrangement of the basic word, or specifying a value in a sentence. 

Jin et al. [14] proposed a technique by adding user-based features with global Lexicon features to 
perform sentiment analysis in a short social media text. Further study on optimal combination methods and 
feature fusing strategies will improve the accuracy of the model. Park et al. [15] used a dictionary-based 
approach to outline thesaurus for sentiment analysis by using three online dictionaries (Co-occurrence words, 
a list of synonyms-antonyms, and seed words). The proposed approach takes more time since it uses three 
dictionaries and also informal words, and occasionally informal words are not considered. Xiong et al. [16] 
developed a domain-Specific sentiment-lexicon by identifying the correlation among sentiment words (global 
information, local information, and constraint information) and it has given good adaptability with a semi- 
supervised approach. 

Vu et al. [17] introduced a Lexicon method for sentiment analysis. It is a new and effective method 
since it combines the most popular Lexicon methods such as the SentiWN and LIU. Ashna et al. [18] 
developed a dictionary-based sentiment Lexicon for Malayalam movie reviews. This method reached 90% 
accuracy at the document level and 87.5% at the sentence level. One limitation of this approach is the 
different collection of phrases and idioms, which is since unigrams are used. This limitation can be overcome 
by using bigrams and trigrams. Awwad et al. [19] suggested a hybrid stemming method to enhance lexicon- 
based sentiment analysis (MPQA IIT, MPQA, HRMA, and HarvadA) at both the document level and sentence 
level. An increasing number of stemmers can help to improve the accuracy of the model. 

Rezapour et al. [20] analyzed the effect of combining manually annotated hashtags with sentiment 
Lexicon and achieved a 7% improvement in accuracy. POS tagging can help improve the accuracy of this 
method. Yadav et al. [21] studied the importance of a domain-specific Lexicon for sentiment analysis. They 
built a domain-specific lexicon by introducing a bigram algorithm with a proposed strategy for developing 
some new corpora. Mowlaei et al. [22] developed a Lexicon using a genetic algorithm and this fits aspect- 
level problems. They have used two Lexicons, the first one is the intensity Lexicon (NRC Hashtag, AFINN, 
and Sentiment 140) and another one is the polarity Lexicon (Liu opinion Lexicon). However, the stemming 
and ordering of phrases are not considered while generating the Lexicon. 


Lexicon-based sentiment analysis for Kannada-English code-switch text (Ramesh Chundi) 


1502 O ISSN: 2252-8938 


Agarwal et al. [23] used the NRC emotion Lexicon to predict how fans’ sentiment changes over 
time. Sohangir et al. [24] compared Lexicon methods (VADER, SentiWordNet) with machine learning 
methods (Naive Bayes, support vector machine (SVM), logistic regression) and found that the Lexicon 
method is faster than machine learning methods. Yuan et al. [25] suggested a method for the Chinese 
sentiment Lexicon using the Word2Vec tool and studied the sentiment words. Increasing the corpus size 
helps improve accuracy. Chathuranga et al. [26] generated a sentiment Lexicon for sentiment analysis in the 
Sinhala language using a semi-automated structure using a corpus-based approach. 

Chang et al. [27] proposed a method using the skip-gram variant for mapping word spaces and 
generated language Lexicons with a smaller number of resources. Further improvement can be possible by 
increasing the vocabulary size and the accuracy of Lexicons. Taj et al. [28] suggested a dictionary-based lexicon 
approach (WordNet Lexicon dictionary) for sentiment analysis in BBC news articles. However, the proposed 
approach has limited word coverage. Yin et al. [29] built an automatic sentiment Lexicon (FCP-Lex) using 
CPchunks and reduced the ambiguity of words and obtained high-quality corpora. However, further study on 
Chinese natural language processing problems such as word embedding, semantic disambiguation, and word 
segmentation can be concentrated. Abd et al. [30] proposed a Lexicon-based sentiment analysis system on 
IMDb movie review dataset. The proposed system remains showed better accuracy unfluctuating, if the size of 
dictionary is altered. Sallam ef al. [31] proposed a collaborative filtering system based on sentiment analysis on 
Arabic book dataset to provide recommendations. The proposed system reduced the average error values in 
terms of root mean squared error (RMSE) and mean absolute error (MAE). 

Pamungkas ef al. [32] performed sentiment analysis (SentiWordNet) by translating Bahasa 
Indonesia code-switch data into English and achieved 68% of accuracy. There are some limitations while 
translating from one language to another language like variation of slang, non-standard language, ambiguity, 
and thwarted expectation phenomenon. Karamollaoglu et al. [33] performed sentiment analysis on Turkish 
messages by extracting equivalent English words for Turkish words using an online dictionary. This process 
leads to misinterpretation and ambiguity of sentiment conveyed in sentences. Rodzman et al. [34] 
demonstrated that Lexicon-based methods resolve the limitations of machine learning techniques for 
sentiment analysis in code-switch data. Lexicon-based methods give more accurate results in comparison 
with Naive Bayes for domain-specific Malay documents. However, adding more data in the dictionary and 
relating phrase levels for optimal results. 

Pratama et al. [35] implemented various combinations of Lexicon resources with machine 
translations for sentiment analysis. The Google Translator with SentiWordNet combination is giving high 
accuracy (72%). The proposed method must be tested on large dataset whether the classifier contributes a 
superior performance or even contributes an inferior performance. Tsamis et al. [36] performed Lexicon- 
based sentiment analysis on bilingual (English and Greek) languages. They used MPQA Lexicon and Greek 
sentiment Lexicon to find the sentiment. Based on the above literature study no researchers have been done 
on Lexicon-based sentiment analysis in Kannada-English code-switch data till now. This is the first study to 
implement sentiment analysis in Kannada-English code-switch data by using the Lexicon approach. A 
Kannada-English code-switch sentiment Lexicon (NBLex) has been generated. 


2. METHODOLOGY 
2.1. Lexicon generation 

In this section, the focus is on how the Kannada-English code-switch corpus can be created and 
annotated and on NBLex Lexicon process. Here, the study considerer the text as a code switch even if one 
word differs from the monolingual condition. The study ensures that all the comments in our corpus are 
written in English script, as shown in Ex2. Since the goal is to perform sentiment analysis in Kannada- 
English code-switch text, comments that do not follow the code-switch nature primarily are not considered. 

Initially, the researchers gathered 7194 Kannada-English code-switch comments from 
YouTube.com based on different areas like social events, movie reviews, celebrities and politics. A pre- 
processing task is carried out to eliminate noisy data like special characters, symbols and digits as they do not 
have any significance in creating Lexicon since these words do not convey any sentiment polarity. Manual 
labeling of sentiment (Positive and Negative) is carried out as we do not have any programmed tagging 
techniques for Kannada-English code-switch text. 


Algorithm 1: Building Frequency Dictionary 
Freq Dicti (D, T, L) 
Input: 
D - Dictionary with keys (word, label tuple) and frequency. 
T — list of comments. 
L- a list equivalent to sentiment (1 or 0). 
Output: 
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D - Dictionary with keys and frequency. 
for t, iin (T, L) 
for wint 
P= (w, i) 
if pin D 
D[p] 


Il 
v 
©, 
7 
i 


else 
D[p] = 1 
return D 


Building the frequency dictionary is the first step in Lexicon creation, where keys are tuples 
(word, label) and values are frequency. The label value is either | or 0 and frequency is an integer value. 
Algorithm | depicts the complete steps for building a frequency dictionary. Further, we compute the positive 
and negative frequency of the word. Next step is, we need to calculate the total number of Positive words, 
Negative words, and total number of words from the corpus. Algorithm 2 depicts the complete steps for 
NBLex Lexicon creation. Further, we compute the Positive and Negative probability for each of the word 
using (1). Where, pp is the positive probability, pf is a positive frequency, Np is the total number of positive 
words, and N is the total number of words. We are using (2), to calculate negative probability of the word. 


pft1 


pp = vite (1) 
_ nf+i1 
a as (2) 


Where np is the negative probability, nf is the negative frequency, Nn is the total number of negative words. In 
the end, we calculate the score for every word using (3). Where S is the score of the word (any real number). 


S = log i (3) 


Algorithm 2: NBLex Lexicon Construction 
Nblex_score (D, T, L) 
Input: 
D - Dictionary with keys (word, label tuple) and frequency. 
T — List of comments. 
L - Sentiment label (1 or Q). 


Output: 
S — Real value for each word in the dictionary. 
eae At 


/* Compute T (total number of words) */ 
V = set p[0] for p in D.keys() 
T = len(V) 
/* Compute total number of positive and negative words */ 
Np = Nn = 0 
for p in D.keys() 
if p[1] > 0 then 
Np += D[p] 
else 
Nn += D[p] 
for win V 
/* Compute Positive and Negative frequencies */ 
pf = lookup(D, w, 1) 
nf = lookup(D, w, 0) 
/* Compute Positive and Negative probability */ 
pwp = (pf +1) / (Np + T) 
pwn = (nf +1) / (Nn + T) 
#Compute Score 
S[w] = log (pwp) - log (pwn) 
return S 


2.2. Sentiment prediction 

The proposed model is implemented by gathering a total of 1,799 comments from YouTube.com 
based on different domains like movie reviews, politics, celebrations and social events, not restricted to any 
one specific domain. During the pre-processing task, removed noisy data such as special characters, symbols, 
and digits, to increase the accuracy of sentiment analysis. Sentiment annotation (Positive and negative) has 
been done by linguistic experts as we conversed in the above section since we do not have any programmed 
labeling system for Kannada-English code-switch text. Figure 1 shows the steps for sentiment analysis 
process on Kannada-English code-switch text. Once the Lexicon is developed, the score for each comment in 
the test dataset is to be calculated. Scores for each word can be calculated using (4). 
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Where, ST refers to sentence score and S(w) is the value/score of each word. Once scores are calculated, then 
the sentiment prediction task is carried out using (5). If the score of comment is greater than 0, then the 
sentiment predicted is positive, otherwise it is negative. Algorithm 3 shows the entire procedure for sentiment 
prediction task using NBLex Lexicon. 


a 1 (positive), if ST > 0 (5) 


0 (negative), otherwise 


Algorithm 3: Sentiment Analysis 
Input: text, t /* text -— Input data 
t - Collection of tokens with scores */ 
Output: Sentiment /* Positive or Negative */ 
/*Calculating score of text (Text comment) */ 
Sent_Score (text, t) 
{ 
S = 0 /* §S is the variable to record the score of text */ 
for i in text /* compute score for each word in the text */ 
if iin t then 
S += S.get(i) 
return S 
} 
/* Finding the Sentiment of test_data */ 
Sent_Analysis (test_data) 
{ 
P= [] /* empty list */ 
for k in test_data /* compute the score for each comment in the test_data */ 
if Sent _Score(k, S) > 0 then 
Ps = 1 (Positive) 
else 
Ps = 0 (Negative) 
P.append (Ps) 
} 


Input 
(Preprocessed Data) 


sil 


| Tokenization 


NBLex 


Calculating 
Sentiment Score 


| Sentiment 


Prediction 


i 


Positive Negative 


Figure 1. Sentiment analysis process 


3. RESULTS AND DISCUSSION 

In this section, the study compares the results of the Lexicon-based approach with other models such 
as Naive Bayes (machine learning) and bidirectional long short-term memory neural network (BiLSTM) 
(deep learning). Initially, the total corpus (8993) is divided (80:20) into training (7194) and text (1799) 
datasets to train and test for both the Naive Bayes and BiLSTM models. The performance parameters such as 
Accuracy, Precision, Recall, and Fl-score are evaluated with Naive Bayes, BiLSTM, and NBLex Lexicon 
models. 
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3.1. Naive Bayes 

In this experiment, we performed two kinds of vectorizations like bag-of-words (BOW) and term 
frequency and inverse document frequency (TF-IDF) for sentiment analysis on the test dataset. Both BOW 
and TF-IDF approaches produce the same outcomes. Table | shows the Accuracy, Precision, Recall, and F1 
scores of both approaches. BOW and TF-IDF both approaches are providing the same values for all the 
parameters (Accuracy, precision, recall and Fl-score). 


3.2. BiLSTM 

The deep learning approach (BiLSTM) is applied to test datasets to perform sentiment analysis. 
Table 2 shows the accuracy, precision, recall, and Fl-score of BiLSTM. On comparing Naive Bayes and 
BiLSTM it is observed that Naive Bayes performs better in terms of all parameters such as accuracy, 
precision, recall, and Fl-score. 


3.3. NBLex 

The study carried out the Lexicon-based analysis to predict the sentiment from the test dataset. 
Table 3 shows the accuracy, precision, recall, and Fl-score of NBLex Lexicon approach. Table 4 shows the 
precision, recall, and Fl-score for all the three approaches i.e., NBLex, Naive Bayes, and BiLSTM. The 
Lexicon-based approach produces better results in terms of precision, Recall and Fl-score in comparison 
with Naive Bayes and BiLSTM. 

In Table 5, the accuracy of three approaches is compared (NBLex, Naive Bayes, and BiLSTM). 
From Figure 2 it is observed that the Lexicon-based sentiment prediction approach produces better results in 
terms of Accuracy with 83.2% in comparison with Naive Bayes and BiLSTM. The main reason for more 
accuracy in NBLex is that the corpus-based Lexicon approach is better in dealing with OOV in code-switch 
text, also there is no limit for vocabulary. 


Table 1. Accuracy, precision, recall and Fl-score of BOW and TF-IDF 
Machine learning vectorization Accuracy Precision Recall _‘Fil-score 
BOW 80.7 0.80 0.83 0.81 
TF-IDF 80.7 0.80 0.83 0.81 


Table 2. Accuracy, precision, recall and Fl-score of BiLSTM 
Deep learning approach Accuracy Precision _ Recall Fl-score 
BiLSTM DS 0.75 0.77 0.75 


Table 3. Accuracy, precision, recall and Fl-score of NBLex 
Lexicon Accuracy _ Precision Recall Fl-score 


NBLex 83.2 0.82 0.85 0.83 
Table 4. Comparative analysis of precision, recall and F1-score Table 5. Comparison of accuracy 
Approach Precision Recall Fl-score Approach Accuracy 
NBLex 0.82 0.85 0.83 NBLex 83.2 
Naive Bayes 0.80 0.83 0.81 Naive Bayes 80.7 
BiLSTM 0.75 0.77 0.75 BiLSTM 75.7 
Accuracy 
85 
> 
© 80 
a3 
&o 75 
. [J 
70 
NBLex Naive Bayes BILSTM 
Approach 


Figure 2. Accuracy comparison of NBLex, Naive Bayes and BiLSTM 
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4. CONCLUSION 

This work proposes a Lexicon-based approach for sentiment analysis in Kannada-English code- 
switch text. This approach performs better in comparison with other existing models like Naive Bayes 
(machine learning) and BiLSTM (deep learning). The proposed approach (NBLex) has achieved 83.2% of 
Accuracy and 83% of Fl-score for Kannada-English code-switch text. We strongly believe that, the Lexicon 
method is an alternative to perform sentiment analysis in code-switch data. Best of our knowledge, this is the 
first work that aimed on sentiment analysis in Kannada-English code-switch text. In the future, we are 
planning to handle sarcastic text for predicting sentiment and emotion, since most of the users are expressing 
positive and negative sentiments or emotions in a single comment. 
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